NCDSearch: Sliding Window-Based Code Clone Search Using Lempel-Ziv Jaccard Distance
نویسندگان
چکیده
Software developers may write a number of similar source code fragments including the same mistake in software products. To remove such faulty fragments, inspect clones if they found bug their code. While various clone detection methods have been proposed to identify either blocks or functions, those tools do not always fit inspection task because fragment be much smaller than blocks, e.g. single line enable search small large-scale product, we propose method using Lempel-Ziv Jaccard Distance, which is an approximation Normalized Compression Distance. We conducted experiment existing research dataset and user survey company. The result shows our efficiently reports cloned performance acceptable for developers.
منابع مشابه
Lempel-Ziv Compression in a Sliding Window
We present new algorithms for the sliding window Lempel-Ziv (LZ77) problem and the approximate rightmost LZ77 parsing problem. Our main result is a new and surprisingly simple algorithm that computes the sliding window LZ77 parse in O(w) space and either O(n) expected time or O(n log logw + z log log σ) deterministic time. Here, w is the window size, n is the size of the input string, z is the ...
متن کاملLempel-Ziv Jaccard Distance, an Effective Alternative to Ssdeep and Sdhash
Recent work has proposed the Lempel-Ziv Jaccard Distance (LZJD) as a method to measure the similarity between binary byte sequences for malware classification. We propose and test LZJD’s effectiveness as a similarity digest hash for digital forensics. To do so we develop a high performance Java implementation with the same command-line arguments as sdhash, making it easy to integrate into exist...
متن کاملLempel-Ziv Factorization: LZ77 without Window
To construct the su x array of a string S boils down to sorting all su xes of S in lexicographic order (also known as alphabetical order, dictionary order, or lexical order). This order is induced by an order on the alphabet Σ. In this manuscript, Σ is an ordered alphabet of constant size σ. It is sometimes convenient to regard Σ as an array of size σ so that the characters appear in ascending ...
متن کاملLempel-Ziv Dimension for Lempel-Ziv Compression
This paper describes the Lempel-Ziv dimension (Hausdorff like dimension inspired in the LZ78 parsing), its fundamental properties and relation with Hausdorff dimension. It is shown that in the case of individual infinite sequences, the Lempel-Ziv dimension matches with the asymptotical Lempel-Ziv compression ratio. This fact is used to describe results on Lempel-Ziv compression in terms of dime...
متن کاملOn Match Lengths, Zero Entropy and Large Deviations - with Application to Sliding Window Lempel-Ziv Algorithm
The Sliding Window Lempel-Ziv (SWLZ) algorithm that makes use of recurrence times and match lengths has been studied from various perspectives in information theory literature. In this paper, we undertake a finer study of these quantities under two different scenarios, i) zero entropy sources that are characterized by strong long-term memory, and ii) the processes with weak memory as described ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEICE Transactions on Information and Systems
سال: 2022
ISSN: ['0916-8532', '1745-1361']
DOI: https://doi.org/10.1587/transinf.2021edp7222